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Mater-ials and Methods Relating- to Disease Diaoiiosis ' 



Field of the Invention 

The present invention concerns materials and methods- 
relating to disease diagnosis. Particularly, but . not 
exclusively, the invention relates to methods, of 
diagnosing tumours, by comparing specific patterns of 
gene expression at. a nucleic acid or protein level using 
expressed nucleic acid, e.g., mRNA or cellular proteins 
associated with the tumour. 

BacXoTTOund of the Invention 

The major characteristics that differentiate 
malignant tumours from benign ones are their properties 
of invasiveness and spread . - Malignant tumours do not 
remain localised and encapsulated: they invade 
surrounding tissues, get into the body's circulatory 
system, and set up areas of proliferation away from the 
site of their original' appearance. When -tumour cells 
spread and engender secondary areas of growth, the 
process is call metastasis; malignant cells having the 
ability to metastasize. 

The earliest stages of malignant tumours are hard to 
identify and pathologists are rarely sure how or where a 
malignancy began. The cells of malignant tumours have a 
tendency to lose differentiated traits and therefore it 
can be difficult to determine the primary origin of the 
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cells following metastasis. 

A concern with the histopathologic assessment of. 
neoplasias (tumour growth) is that tumour classification 
.is based on subjective evaluation (1, 2) . Immunostaining 
5 can be used to determine the expression of various 

diagnostic markers and may increase reproducibility. 
Ovarian cancer is an example of a disease where the 
diagnostic difficulties .are considerable (3) , Epithelial 
neoplasias of ovarian cancers are. classified into benign, 
10 borderline and malignant tumours. Borderline tumours are 

often difficult to diagnose, and it is not known if most 

these tumours represent intermediate steps in tumour 

progression or whether these tumours should be considered 
as a separate group (4) . Relative survival decreases with 
15 increasing tumour stage or grade. Five-year survival is 

considerably lower for women with carcinoma (38%) than 
for . women with borderline carcinoma (95%) . 

Siimmarv of the Invention 

2 0 The present inventors have appreciated that carrying 

out routine tumour diagnosis in an accurate and objective 
manner is very difficult. The process is preoperatively 
dependent on an experienced cytologist and/or 
postoperatively dependent on an experienced pathologist, 

IS and is at present based on morphological ' judgements . 

Further, the primary tumour source can be difficult to 
determine which may lead to miss-diagnosis and 
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inappropriate treatment regime . ■ Therefore the present 
inventors have realised that there is a need for a 
diagnostic tool that can perform preoperative diagnosis 
objectively. Such' a tool should help to reduce the 
number of patients undergoing unnecessary , and expensive 
therapy. 

Multivariate analysis of the expression of a series 
of diagnostic markers is one approach to diagnostic 
problems. If. a sufficiently large data set is collected, 
it may be possible to recognize patterns. of expression in 
different histological groups . Goldschmidt et al . (5) 
showed that, multivariate analysis of 47 histological 
variables generated by computer-assisted microscope 
analysis, facilitated classification of adipose tumours. 
Similarly, multivariate analysis of RNA expression data 
has been used to discriminate , between fibroblast subtypes 

(6) . ^ : ^ 

One -approach to obtain a large data set is to use 
high resolution two-dimensional polyacrylamide gel 
electrophoresis (2 -DE) . This technique is able to resolve 
more than one thousand polypeptides on a single gel. The 
pattern can be analysed by computer software such as 
PDQUEST and MELANIE II (7, 8) , This approach has been 
previously used for the classification of lung tumour 
cell lines ( 9) . 

An alternative approach to obtaining a large data 
set is to use micro-array technology, ' Nucleic acid 



wo 00/70340 



PCT/EPOO/04265 



sequence characteristic of nucleic acid sequences 
expressed in certain cell types, e.g. MRNA or cDNA, can 
be analysed in -.this way. There is an increasing tendency 
towards miniaturisation of assays' which use binding 
members (such as antibodies or nucleic acid sequences) . 
For example, the binding members may be immoblised in 
small discrete locations (microspots) and/ or as arrays 
(micro-array technology) on solid supports or on 
diagnostic- chips. These approaches , can be particularly ■ 
valuable. as they can provide great sensitivity 
(particularly through the use of fluorescent labelled 
reagents) , require only very small amounts if biological 
sample from individuals being tested and allow a variety 
of separate assays to be carried out simultaneously. . 
Examples of techniques enabling miniaturised technology ' 
are provided in WO84/01031, WO88/1058, WO89/01157, 
W093/8472, W095/18376, W095/18377, W095/24649 AOT) EP-A- 
0373203 . 

Early research by Fedor et al established that 
silicon could serve as a substrate onto which organic 
molecules such as DNA could be synthesized. The process 
now commercialised by Affymetrix Inc. Santa Clara, 
California, involves the use of' serial photolithographic 
steps to build oligonucleotides in situ at a specific 
addressable position on the chip. 

The strategy of addressing specific nucleic acid 
sequences synthesized off chip, then hybridized to a 
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particular location on a chip by electrical attraction to 
a charged microelectrode has been developed by Nanogen 
.Inc. Variation on the theme of microaddres sable arrays 
has recently led to the evaluation of chips for sequence 
analysis of uncharaterised genetic material, mutational 
'analysis, of . .a known gene .locus, and for the evaluation of 
a particular cell or tissue's profile of gene expression 
for the whole complement of the .human DNA sequence. These 
methodologies typically relay on the use of laser 
activated fluorescence of addressable signals on a ■ 
microchip. . . ■ f .. 

Thus, at its most general, the present invention 
provides . materials and. methods for, firstly obtaining- a 
number of protein or nucleic acid expression profiles 
characteristic for disease states of different origins or 
different stages of development - or malignancy; secondly, 
analysing said expression profiles in order to determine 
specific diagnostic markers; and. thirdly , diagnosing the 
presence of a disease, e.g. tumour, the type of disease 
or the stage of development of said disease e.g. tumour 
malignancy by comparison- of its protein or nucleic acid 
expression profile with those previously obtained to 
determine using the specified diagnostic markers. 

Thus, the present invention primarily relates to a 
method of obtaining gene expression profiles in order to 
determine diagnostic markers characteristic of a selected 
disease type or stage of development of a disease 
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comprising" 

(1) obtaining cells from a sample of said disease 
tissue; 

(2) disrupting cells to expose the cellular products 
5 characteristic of gene expression; 

,.(3). .separating .said cellular products according to 
their characteristic properties on .a substrate; and 
(4) carrying out computer-assisted multivariate 
. analysis of the substrate to quantify and characterise 
10 . the cellular product distribution on the substrate to 
identify specific diagnostic markers characteristic of 
said disease . 

Depending on the cell type, different genes are 
expressed or are expressed at different levels or 
15 frequency. These differences in gene expression may be 

used to characterise the type of cell . The cellular 
products that reflect the differences in gene expression 
are those products produced downstream of the nucleic 
acid transcription and translation process, e.g. mRNA or 
2 0 the expressed protein itself. These cellular products may 

then be separated according to their -own characteristic 
properties, e.g. size, charge or sequence. 

In a preferred embodiment of the invention, the 
cellular products are expressed proteins which may be 
25 separated according to their size on a • electrophoresis 

gel, preferably a two dimensional electrophoresis gel'. 
Alternatively, the cellular products may be 

^OCiO: <V*'0 0C'7034QA2„r_> 
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separated according to their characteristic properties 
using a substrate comprising specif ic . binding members, 
for example, antibodies or oligonucleotides. As mentioned 
a;bove, this is conveniently done by using a micro-array. 
In such a situation, it is preferable to label the 
cellular products, e.g. radioactively or f Increscent ly or 
enzymatically, to assist in the computer-assisted 
multivariate. analysis. 

Therefore/ in a first aspect, the present invention 
provides a method of obtaining protein expression 
profiles in order to determine diagnostic markers 
characteristic of selected disease types or stages of 
disease development comprising 

(1) obtaining cells from a sample of said disease 

type; 

(2) disrupting cells to expose the cellular proteins 
contained there in; 

(3) separating said cellular proteins using a two- 
dimensional electrophoresis gel; and 

(4) carrying out computer-assisted multivariate 
analysis of the two-dimensional electrophoresis gel to 
quantify and characterise the protein distribution on the 
gel to identify specific diagnostic markers 
characteristic of said disease. 

In order to carry out the analysis as outline in 
step (4) , quantitative and qualitative data from the two- 
dimensional electrophoresis gel is firstly obtained. 
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Thus, step (4) may require carrying out multivariate 
analysis of the quantitative and qualitative- data from ■ 
the two-dimensional gel to characterise the protein 
expression profile and identify specific diagnostic 
markers characteristic of said disease. 

In an alternative first aspect of the pres.ent 
invention, there is provided a method of obtaining, gene 
expression profiles in order to. determine diagnostic . 
markers characteristic of selected disease types or 
stages of disease development , said method comprising. 

(1) obtaining cells from a sample of said disease 

type 

(2) disrupting cells to obtain the expressed nucleic 
acid contained therein; 

.(3) separating said expressed nucleic acid using a 
micro- array; and 

(4) carrying out computer-assisted multivariate 
analysis. of the micro-array to quantify and characterise 
the expressed nucleic acid on the micro-array to identify 
specific diagnostic markers. 

The expressed, nucleic acid is preferably mRNA which 
may be obtained from the cells by standard molecular 
techniques known to the skilled person, for example see 
Sambrook, Fritsch and Maniatis, "Molecular Cloning, A ' 
Laboratory Manual", Cold Spring Harbor Laboratory Press, 
1989,. and Ausubel et al, Short Protocols in Molecular 
Biology, John Wiley and Sons, 1992) . Alternatively, cDNA 
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may be created from the expressed mRMA by reverse 
transcription . before separation and analysing on the 
micro-array. Micro-array technologies use 
oligonucleotides- (representing thousands of different 
genes) bound to given positions on various substrate. 
Total mRNA is purified from a cell/tissue sample and cDNA 
is produced by reverse transcriptase . Various steps (e.g. 
in vitro transcription using biotinylated nucleotides) 
may then . be added' before hybridisation and visualisation ^ 
•depending on the specific type of micro-array technology ., 
used (e.g. Affymetrix .chips, Clontech membranes) . The: 
final read-out is a signal that is proportional to the 
quantity of a given expressed gene. 

The. present inventors have discovered that proteins 
are differently expressed or differentially regulated 
between various malignant tumours and benign tumours . 

' Therefore, the inventors believe that the present 
invention will have particular utility in relation to the 
diagnosis of tumours. Although the following description 
of the invention concentrates on the diagnosis of tumours 
in general, it will be appreciated by the skilled person 
that the present invention may equally and advantageously" 
be applied to the diagnosis of other diseas.e states 
characterised by gene expression .prof iles , e.g. 
hypo/hyperthyroidism, diabetes, or organ- rejection. 
Further, the invention may be used to test. plasma samples, 
for leukaemia or other hematopoetic disorders. 
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in previous studies carried out by the present 
inventors, a large degree of heterogeneity in protein 
expression was observed, particularly in malignant 
tumours (17, 18). Both qualitative and quantitative 
differences were found within each tumour group. 
However, the large quantitative variability indicated 
that identification based on pattern recognition would be 
difficult. . However, the present inventors show herein 
that it is possible to select a subset of variables which 
show a characteristic pattern within the group, and thus 
are useful for prediction of the presence of malignant 
cells and their initial origin. 

Thus, in a second aspect of the present invention, 
there is provided a method of creating a collection of 
diagnostic markers based on protein expression levels for 
use in classifying .disease cells in a given sample, 
comprising 

(1) obtaining cells from a. plurality of samples of a 

selected disease type; 

(2) disrupting the cells to expose the cellular 
proteins contained therein; 

(-3) separating the cellular proteins according to 
their size on a two-dimensional electrophoresis gel for 
each of said plurality of samples or a selected disease 
type; and 

(4) scanning said two-dimensional electrophoresis 
gels to collect image data for each of the plurality of 
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samples of a selected disease type; 

(5) analysing said image data in order to identify 
one or more markers characteristic of said selected 
disease type. 

In an alternative second aspect of the present 
invention, there is. provided a method of creating a 
collection. of diagnostic markers based on nucleic acid 
expression levels for use in classifying disease cells in 
a given sample , comprising 

(1) obtaining cells from a plurality of samples of a 
selected disease type 

(2) disrupting the cells to. obtain the expressed 
nucleic acid sequences contained therein, 

(3) separating the expressed nucleic acids sequence 
according to their nucleotide sequence using micro-array 
technology for each of said plurality of samples of a 
selected disease type; 

(4) scanning said micro -array to collect" image data 
for each of the plurality of samples of a selected 
disease type; and 

(5) analysing said image data in order to identify 
one or more markers characteristic of said selected 
disease type. 

Again, the disease type is preferably cancer, 
wherein a plurality of samples -may be collected from 
tumours of a particular cancer, e.g. ovarian, breast, 
skin etc, and its gene expression profile characterised 
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by the , present invention,' 

It is important that • the scanning of the 
electrophoresis gel or the micro-array easily identifies 
the . separated proteins or nucleic acids respectively.' 
Therefore the method may further comprise the step of 
labelling, the obtained proteins or expressed nucleic 
acids . Nucleic acid sequences may be labelled by 
standard techniques known to the skilled person such as 
fluorescent, enzyme or radio-active- labelling. As an 
alt ernative to labelling obtained proteins, the gels may 
be stained with, for example silver nitrate, and scanned 
using a laser densitometer. Alternatively, the gels may 
be analysed using computer-assisted microscope to 
facilitate classification. The data obtained and 
statistical comparison may be performed- In particular, 
this is preferably a multivariate characterisation of one 
or more numerical parameters associated with the 
proteins. In other words, multivariate analysis of a 
plurality of variables generated by, for example, 
computer-assisted image analysis may be easily 
classified. The statistical comparison may, for example, 
identify a sub-set of proteins, from among all of the 
proteins on the 2 -DE, having a statistically significant 
degree of expression and/or correlation when compared to 
other samples from similar tumour cells. This sub-set of 
proteins may then be used as diagnostic markers for the 
particular tumour or stage of malignancy. Preferably, a 
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plurality of 2-DE gels are analysed and ' the distribution 
pattern of the proteins are determined. A model may then 
be set up with a specified- number of. variables between 
,the tumour cells being analysed.' For example, a 
comparison may be made between 

benign/borderline/malignant. Preferably the number of 
variables separating the groups whether proteins or ' 
expressed nucleic .acid sequences, will range, between 20 
and 5 00, more preferably 5 0 and 3 00, even more preferably 
100 and 200.. In general, it is preferably that the 
number of variables is' at least. 20, more preferably at 
least 50 and even more preferably at least 70, 100 or ISO 
variables. In the present case, the inventors used 170 
variables , 

Quantification and multivariate characterisation of 
the expression profiles of selected protein or nucleic 
acid groups may be performed on image analytical data 
obtained from analysis of the 2-DE or. the micro -array 
respectively and used for objective classification of the 
tumour cells in a given sample. The multivariate 
characterisation may be carried out by partial least 
squares discriminant analysis (PLS-DA) . This process 
allows (i) the construction and characterisation of a. 
protein or nucleic acid expression profile database and 
data extraction of a plurality of sets of proteins or 
nucleic acids which contribute . significantly to the 
diagnosis/classification of a disease state; (ii) add 
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samples/protein or nucleic acid expression profiles to 
the database and further improve the future accuracy of 
the diagnosis/classification; and (iii) query the 
database via the expert system using, new tumour 
samples/protein or nucleic acid expression patterns 
aiming at a prediction of diagnosis . 

A protein expression profile database comprising 
image, data which has been analysed in order to determine 
a plurality of variables for use as diagnostic markers; 
said data being obtained from analysis of . two-dimensional 
electrophoresis gels showing. characteristic . protein 
distribution associated with a disease type or state of 
development of said disease for use in disease diagnosis 
forms another aspect o£ the present invention. 

A nucleic acid (mRNA or cDNA) expression profile 
database comprising image data which has been analysed in 
order to determine a plurality of variables for use as 
diagnostic markers; said data being obtained from 
analysis of a micro-array showing characteristic 
expressed nucleic acid sequence distribution associated 
with a disease type or stage of development of said 
disease, for use in disease diagnosis forms yet another 
aspect of the present invention. 

In a further aspect, the present invention provides 
a method of determining the presence, type or stage of a 
disease type in a patient comprising the steps of 

(1) extracting a sample of candidate disease cells 



wo 00/70340 



PCT/EPOO/04265 



- 15 - 

from the patient; 

(2) disrupting the cells so as to expose the 
cellular proteins contained therein; 

(3) separating said cellular proteins on a two- 
dimensional electrophoresis gel; and 

(4) analysing said gel by computer assisted image 
evaluation so as to compare protein distribution on gel 
with a database of diagnostic markers characteristic of a 
plurality of disease types or stages of disease 
development to determine presence, type or risk of. said 
disease in said patient . 

The present invention also provides a method of 
determining the presence, type or atageof a disease in a 
patient comprising the steps of 

(1) extracting a sample of candidate disease cells 
from a patient; 

(2) disrupting the cells so as to obtain the 
expressed nucleic acid sequences contained therein; 

(3) separating said expressed nucleic acid sequences 
on a micro-array according to their nucleotide sequence; 
and 

(4) analysing said gel by computer assisted image 
evaluation so as to compare expressed nucleic acid 
distribution on said micro-array with a database, o'f 
diagnostic markers characteristic of a plurality of 
disease types or stages of disease development to 
determine presence, type or risk of said disease in said 
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patient. 

-PreferaiDly, the disease type is cancer and the 
disease cells are tumour cells. 

• Sample preparation may be carried out using standard 
■techniques. One typical sample may. contain approximately 
one million cells. Samples may be collected using fine 
needles aspiration biopsy (FNA) - a routine -technique 
used for cytological, diagnosis . The major advantage of 
using FTJA combined with the expert system is (i) early 
diagnosis if possible',, a prerequisite .for making early 
decisions on therapy (ii) effects of hormone.- or . 
-chemotherapy can be followed at protein expression level, 
providing early information on e.g. resistance against 
treatment; and (ili) the analysis is based on an average 
expression profile of the cell population. 

Samples may also be collected after surgery for 
analysis in order to guide pathological examination and 
selection of post-operation therapeutic strategy. 

As mentioned above, the earliest stages of malignant 
tumours are hard to identify and pathologists are. rarely 
sure how or where a malignancy began. The present 
invention therefore has further utility in being able to 
more accurately determine the primary origin of tumour 
cells as the primary tumour and its corresponding 
metastasis express very similar 2 -DE protein profiles 
(Franzen et al. Int. J. Cancer 1996, 69, 408-414). Such 
analysis will therefore assist a clinician in determining 



wo 00/70340 



PCT/EPOO/04265 



- 17 - " 

the location of the primary tumour. < 

The above disclosure concentrates on the analysis 
and diagnosis of tumours. However, as mentioned above, 
the present invention may also be usefully applied to the 
diagnosis of any disease state that can be. characterised 
by a statistically significant protein expiression profile- 
which allows the identification of specific diagnostic 
markers . 

By way of example only, a brief outline/workflow on 
how the computer analysis may be set up in practice is 
provided below: 

1. A new tumour sample is prepared, analyzed by. 2 -DE 
and the expression pattern is scanned. 

2 . All protein spots in this expression pattern is 
quantified and matched against a reference pattern 
using any establ.ishe<d software for basic 2 -DE 
analysis (e.g- PDQuest, Melanie, B io Image ) , 

3. The data is first organized in a Excel -spreadsheet 
like format table with all protein spot reference • 
numbers in the first column and individual 
normalized protein quantities for every analyzed 
sample in the following columns. A new case/pattern 

"is added as a new column. This corresponds to the 
. "data table X" , 
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The process of "data mining" - to find those, 
proteins/variables which contribute most to the 
separation of . tumour classes - and build the 
learning set (the^^core of the database), is based on 
the PLS-DA analysis. Here, an additional "data 
table Y" is included, as described under materials 
and methods, data preprocessing (please see. also 
references 14 & 15) . Graphically and numerically' it 
is possible to make a first selection of variables 
(those that are far from or igo (compare fig, 4) in 
the same and opposite direction from the 
corresponding position of ' tumour classes, compare 
fig. ■ 3) . • * . 

In an "interactive sub-routine or process, this first 
set of variables is crossvalidated by excluding 
cases one by one in sequences, rebuild the model and 
make a prediction of each of the excluded cases. 
Then, a second set of variables are selected 
(according to step 4) , and so on - until the 
predictive value reach an optimum. In the present 
case, a -set of 170 variables was selected. in this 
way (step 4 and 5) and is therefore not a random 
choice . 



Next, the true predictive value is determined usi 
a new set of cases- (the test set) . 
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7. This process, step 3-6, can then be repeated with an^ 
■ increased number of cases in order to further 

improve the predictive accuracy. 

8. . A new case (an unknown tumour sample) is then 

analyzed by. 2-DE/basic image analysis, the pattern 
is. compared with respect to the defined group of 
variables in the database model and classified 
using, for example, PLS-DA prediction in order to 
obtain a diagnosis. Each new case may also be added 
to. the. database for future improvements of the 
predictive value ,of the model. 

One part of the expert system/ computer software is 
to integrate steps 3 to 7 and make the process user- 
friendly in order to guide the investigator towards the 
construction of a model within the data base which 
provide, high predictive accuracy! The other part of the 
expert system/computer software is to facilitate the 
query of the model using a new case in order to. obtain a 
diagnosis (step 8 above) . In addition to these 
"calculation parts" of the expert system, information may 
be included on sample preparation and. on sample 
characteristics, 5 -year survival data etc. ■ 

Thus, in the further' aspect of the present 
invention, there is a provided a diagnostic kit for 
diagnosing the presence, type or stage of a disease, e.g. 
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a tumour or malignancy of. a tumour, said kit comprising a 
database capable of quantifying an protein or nucleic 
acid expression pattern and comparing it against 
reference patterns held within the database. The kit may 
also optionally include, instructions for carrying out 
any of the methods described above; apparatus for 
carrying out a 2 -DE; micro-array technology or a laser 
densitometer or other, image scanning device. 

Aspects and embodiments of the present invention 
will now be illustrated, by way of example, with 
reference to the . accompanying figures ., Further aspects 
and embodiments will be apparent to those skilled in the 
art. All documents mentioned in this text are 
incorporated herein by reference. 

Brief Description of the Drawings 

Fig. 1 The two first principal components scores 
{t2 against ti) of the 2-DE training data-set (22. gels and 
1553 spots). A = benign ovary tumour sample (open 
circles) , B = borderline ovary tumour sample (mixed 
circles) , and C = malignant ovary tumour sample (filled 
circles) . • 

Fig, 2 The two first principal components scores . 
(t2 against t^) of the most informative part of the 2-DE 
training data-set (22 gels and 170 spots) . For 
descriptions, see Fig 1. 

Fig, 3 The two first PLS-DA scores (tPS2 against 
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tPSi) of the entire 2-DE data (40 gels and 170 spots) . The 
samples in the learning- set are indicated' using circles ( 
A = benign ovary . tumour sample (open circles), B = 
borderline ovary tumour sample (mixed circles) and C = 
5- malignant ovary tumour sample (filled circles) . The • • 

samples in the test -set are indicated using filled/mixed - 
and open squares in analogy with the . learning- set . 

Fig. 4 The corresponding loading plot to Fig. 3 
(wc2 against wc^^) . Indicated are the loading scores for 

10 -the most significant spots for separation of the three 

tumour classes. 

Fig. 5 The two first principal components scores (ts, 
against t-J of breast tumour samples (33 gels and .17 0 
spots) . Cases classified as carcinoma are labelled "C" 

15 a;nd have filled symbols; cases classified as fibroadenoma 

are marked with FA and have open symbols . 

Detailed Description 

2 0 1) MATERIALS AND METHODS 

Tumour tissue samples 

All samples were obtained within 4 0 min after 
resection and tumour cells were enriched as previously 
25 described (10) . Histopathological characterization was 

carried out using hematoxylin- eos in stained sections of 
formalin fixed and paraffin embedded specimens. Tumours 
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were classified using the WHO. system. 



10 



Electrophoresis, scanning and ima ge anRl^^-^c, 

2-DE was performed as, previously described (11). 
■ Resolyte (2%, pH 4 - 8, BDH) were used for isoelectric 
focussing, 10 - 13% linear gradient SDS-polyacrylamide gels 
were used in the second dimension. Gels were stained with 
silver nitrate as described by Rabilloud et al . (12) and 
scanned at 10 0 mm resolution using a Molecular Dynamics 
laser densitometer. Data was analysed using. pdqueStTM 
software (7) obtained from Pharmacia Biotech (Uppsaia, 
Sweden) . . 

Data preprocessing 

The data from the matchset was exported from PDQUEST 
gel analysis package in the form of tables, with rows 
representing gels and columns representing spots (data, 
table X - see. references 14 and 15). Before the' analysis, 
the data was standardized by dividing each variable (table 
column) by its standard deviation, thereby giving each 
variable the same influence in the analysis . Thereafter the 
data is centred by subtracting from each .column its 
average . 



2 5 Data analysis 

The preprocessed data table (data table X) ' was 
analysed by two data analysis methods. The first one, 
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Principal Component Analysis (PCA) , extracts the 
information in the data, in form of eigenvectors or 
principal components. Visually, one can see this as a' cloud 
of. points (the individuals cases/gels) in a 
multidimensional space (each axis ' s representing each 
spot). PCA first centers the data. Secondly, it rotates the 
data in such a way that the greatest amount of linear 
variation is described by the first component axis, the 
residual variation is described by the second component 
axis, and so on. Most of the information is often 
compressed into two . or three components , A more detailed 
description of PCA may be found elsewhere (13) . 

The . second data analysis method, " Partial Least 
Squarest (PLS) - Discriminajit analysis, was used to 
classify the cases into the three tumour-classes (benign, 
borderline or malignant) . An additional data table (data 
table Y) with the classification of the tumours is included 
into , the analysis-. Table Y consists of the .same number of 

columns as the number of tumour classes and the number of 
rows is equal to the number of cases. The table is then 
filled with suitable dummy variables (i.e. 1 = belongs to 
a specific tumour class or 0 = does not belong) . 

The PLS-analysis is similar to PCA in that.it projects 
the data table X into a vector. It differs, however, in 
that the direction of the vector is determined both by the 
variation of data table X (as in the case of PCA) as well 
as the variation of data table Y. For further descriptions 
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Of PLS, see (14, 15). The significance of the PLS-model is 
checked by cross-validation. Data from a small number of 
samples is kept out of the calculation, the .PLS model is 
computed from the remaining -data, and the y-values of the 
deleted are thereafter predicted from the model . The 
differences in square between predicted and actual y-values 
for deleted samples are summed to form PRESS (Predictive 
Error of Sum Squares) . This sequence is repeated until each 
sample has been deleted once. 

The data-table used for training the PLS-model 
consists of 22 cases and 17 0 spots (Table X) . To test the 

model a talkie (18 cases, and 170 spots) with unknown tumour 

class was used (Table X) .. 

• The data analysis were carried out on CODEX™ software 

obtained from Sumit System AB (Stockholm, Sweden) and 

SIMCA™ software obtained from Umetri AB (Umea, Sweden) . 

2) RESULTS 



Creation of a Learning Set 

Cells were extracted from fresh ovarian tumour tissue 
and single cell suspensions free of erythrocytes were 
prepared (11) . Cytological smears were prepared from all 
preparations and samples usually contained > 90% tumour 
cells (histopathological characteristics are presented in 
Table 1) . 2 -DE polypeptide patterns obtained from these 
cells were analysed by the PDQUEST^^' software (7) . The 
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patterns of polypeptide expression in 22 ovarian tumours 
were examined, 5 benign (A), 6 borderline (B) arid 11 
■ malignant (C) cases (objects) . These patterns were matched 
together and a reference . 2 -DE map was constructed 
containing 1553 spots (variables) . 

As an initial step, principal component analysis ^ was 
applied to entire material (22 gels and 1553 spots) to 
provide an overview over the data structure, to identify 
outliers and possible clusters.- Normalized quantities 
(expressed as ppm) for all spots were used for the PCA. 
Fig.^1 shows the scores. . for the - first two components.^ A 
coarse separation into two major groups/ A + B and C was 
observed, indicating that latent structures with predictive 
value are present in this set of data. However, the 
corresponding loading plots showed very scattered data 
(data not shovm) . 

Of the original data (1553 variables, Fig.l), 170 
variables had a substantial influence on the model • (PLS; 
loadings > 0.02). Approx. 100 variables were active in 
separating the groups A + B (benign/borderline) and C 
(malignant) . and approximately 70 variables in separating 
between A (benign) and B (borderline) . An improved 
separation of the clusters representing each of ^ the three 
classes was observed using these 170 variables (Fig. 2). 
Four significant PLS-DA vectors were found, by using cross- 
validation (Q2 = 0,84)\ describing 98.4 % of the variance in 
Y' and 4 0.7 % in X, This data set was then closed and 



wo 00/70340 



PCT/EPOO/04265 



. - 26 - ■ 

called "learning set". 

Testing the model with unknown tumours 

Eighteen new cases were analysed by 2-DE and added to 
.5 the existing matchset. Expression. levels of the 170 markers 

for. all cases were analysed blindly using PCA, enabling the 
distribution. of new objects. Figure 3 show the predictions 
of unknown cases in a PLS^ score plot (and the corresponding 
loadings in Fig . 4 ) . 

After breaking the code, 6 of 8 malignant cases were 
correctly classified. Case 84 and 8 9 were classified as 
borderline. Furthermore, 3 of 4 borderline cases were 
correctly classified, whereas borderline- case 96 was 
classified as benign. Benign cases 90 and 95 were correctly 
were correctly classified. Of the remaining. 4 cases, 3 were 
classified as borderline and one (case 29) as 
borderline/malignant. 
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Testing a ovary model with breast tumours 

The possibility that an ovarian cancer model could be 
used for classification intraductal breast tumours was 
exploited. The present inventors matched the ovary tumour 
matchset standard 2-DE map with a corresponding breast 
tumour standard map in the database (16). Seventy-five of 
the 170 markers, were present in the breast standard map. 
Fig. 5 shows the PCA distribution of 33 cases of breast 
cancer (26 carcinomas, 6 fibroadenomas and 1 normal breast 
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epithelium) , Only a tendency of clustering of benign cases 
was observed which indicate that some but not all of the 
markers show predictive value. 

3) DISCUSSION 

The present inventors present; here a .first attempt to 
apply artificial learning strategies using quantitative 2- 
dimensional electrophoresis data for tumour diagnosis. A 
subset of the information in the 2-DE pattern, based on 170 
spots, was selected. Using these variables, a learning set 
was constructed where an acceptable separation of - the-, 
groups benign/borderline/malignant tumours into three 
clusters was obtained. Whether other combinations of spots 
will result in an improved separation is unknown and 
difficult to test, since each learning set has to be tested 
by a new panel of unknown samples. We tested the learning 
set using 18 cases, and observed a correct classification 
of the majority of these (11/18) . 

It is well known among pathologists that the routinely 
used limited number of diagnostic sections may not be 
representative for a certain lesion. In this context it is. 
important to note that the sampling technique employed for 
2-DE analysis is more likely to meet the requirements for 
lesion represent ivity . 

In previous studies by the present inventors, a large 
degree of heterogeneity in polypeptide expression was 
observed, particularly in malignant tumors (17, 18). Both 
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qualitative and quantitative differences were found within 
. each tumour group. Particularly, the large quantitative 
variability indicated that identification based on pattern 
recognition would be difficult. The present data suggests 
that it is possible to select a subset of variables which 
show limited variability within the group, and useful. for 
prediction. 

Neural networks and artificial learning has been used 
to predict cancer prognosis and for grading .tumors ^5, 19- 
22) . The parameters used have been various TNM- scoring 
systems, nuclear grading, . tumour markers and 
histopathological scoring. For prostate cancer, the 
sensitivity of the network was between 81 to 100% and the 
specificity 72 to 75% to predict various outcomes such as 
15 seminal vesicle and lymph node involvement (22) , Similarly, 

neural network analysis has been performed on breast 
cancer, using parameters such as hormone receptor status, 
DNA index, tumour size/ number of axillary lymph nodes 
involved with tumour as input information : (20) . These 
studies have indicated that artificial learning is a 
powerful method to increase the diagnostic accuracy on 
individual tumours. 

The present inventors have noted that many of the 
alterations observed in 2 -DE ' pattern are similar between 
tumours of epithelial origin. Thus similar changes in the 
expression of some cytoskeletal and- stress proteins • are 
observed in breast, ovarian and prostate tumors (10; Alaiya 
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et al . , unpublished). With this background, it was 

interesting to examine whether a selected set of ovarian 
markers could be used for classification of intraductal 
breast tumors into benign and -malignant . Some clustering of 
5 benign cas.es was observed, whereas ' malignant cases showed 

extensive scattering, : It seems reasonable to suggest that 
it will be difficult to construct a universal model for 
epithelial tumors, and that learning sets have to be 
created for each tumour type. 
10 In conclusion, the present study ' suggests ' that 

artificial learning strategies can be used for tumour 
diagnosis . 
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Claims 

A method of obtaining combinations of gene 
expression profiles in order, to determine diagnostic 
. 5 markers • characteristic of a selected disease type or 

Stage of development of a disease comprising 

(1) obtaining cells from' a ■ sample of said disease 
tissue; 

(2) disrupting cells to expose the cellular products 
10 characteristic of gene expression; 

(3) separating said cellular products according to 
their characteristic properties on a substrate;' and 

(4) carrying o.ut computer-assisted multivariate • 
analysis of the substrate to quantify and . characterise 

15 the cellular product distribution on the substrate to 

identify specif ic diagnostic markers characteristic of 
said disease. 

2 . A method according to claim 1 .wherein the cellular 
20 products characteristic of gene expression are proteins. 

3 . A method according to claim 1 or claim 2 wherein the 
substrate is an electrophoresis gel which allows 
separation of the cellular products characteristic of 

25 gene expression according to their size. 

4. A method according to claim 3 wherein said" gel is 
2D-electrophoresis gel. 

30 5. A method according to claim 1 v^herein the cellular 

products characteristic of gene expression are .nucleic 
acid sequences, 

6 . A method according to claim 5 wherein the nucleic 
35 acid seqiaences are mRNA, 

7. A method according to claim 1, claim 5 or claim 6 
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wherein the substrate comprises a plurality of binding 
members capable of binding said cellular products 
characteristic of gene expression, 

■ 5 8.. A method according to claim 7 wherein said binding 

members are oligonucleotides capable of binding said 
cellular products characteristic of gene expression 
according , to their nucleotide sequence. 

. ^ method according to claim 1 or claim 2 wherein 

said binding members are antibodies. 

10. A method according to any one of claims 7 to 9 
v/herein is said substrate .is a micro-array . 
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11 . A method according to any one of the preceding 
claims wherein said cellular products characteristic of 
gene expression are labelled to assist computer- assisted 
multivariate analysis . 

12. A method according to any one of the preceding 
claims wherein said multivariate analysis is carried out 
by partial least squares discriminant analysis (PLS-DA) 

13. A method according to any one of the preceding 
claims wherein the disease is cancer and the cells are 
tumour cells or normal reference cells within a given 
tumour. 



14. A method of creating a- collection of diagnostic 
markers based on protein expression levels for use in 
classifying disease cells in a given sample, comprising 

(1) obtaining cells from a plurality of samples of a 
selected disease; 

(2) disrupting the cells to expose the cellular 
proteins contained therein; 

(3) separating the cellular proteins on a two- 
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dimensional electrophoresis gel for each of said 
plurality of samples of the selected disease; and 

(4) scanning said two dimensional electrophoresis 
gels. to collect image data for each' of the plurality of 
samples of the selected disease. 

15.. A method of creating a collection of diagnostic 
markers based on nucleic acid expression levels for use 
in classifying disease cells in .a given sample,^ 
comprising 

(1) obtaining cells from, a plurality of samples of a 
selected disease; 

(2) disrupting the cells to obtain the expressed 
nucleic acid sequences contained therein; 

(3) ■ separating the expressed nucleic acid sequences' • 
on a micro-array for each of said plurality of samples of 
the selected disease; and 

(4) scanning said micro-array to collect image data 
for each of the plurality of samples of the selected 
disease . 

16. A method according to claim 14 or claim 15 further 
comprising the step of analysing said image data in order 
to identify one or more markers characteristic of said 
selected disease. 

17. A method of determining the presence, type or stage 
of a disease in ■ a patient comprising the steps of 

(1) extracting a sample of candidate disease cells 
from the patient; 

(2) disrupting the cells so as to expose the 
cellular proteins contained therein; . 

(3) separating the cellular proteins on a two- 
dimensional electrophoresis gel; and 

(4) analysing said gel by computer assisted image 
evaluation so as to compare protein distribution on gel 
with a database of diagnostic markers characteristic of a 
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plurality of tumour types or stages of malignancy to 
determine presence, type or risk of said disease in said 
patient . 

18. A method of determining the presence, type or stage 
of a disease in a patient comprising the steps of ' 

. (1) extracting a sample of candidate disease cells 
from the patient.; 

(2) disrupting the cells so as to obtain the 
■ expressed nucleic acid sequences contained therein; 

(3) separating the expressed nucleic acid sequences 
on a micro-array according to their individual nucleotide 
sequence; and. 

(4) analysing said micro-array by computer assisted 
15 image evaluation so as . to compare expressed nucleic acid 

distribution on said micro-array with a database of - 
diagnostic markers characteristic of a plurality of 
disease types or stages of development of said disease to 
determine presence, type or risk of said disease in said 
20 patient. 

19. A method according to any one of the preceding 
claims wherein the number of markers characteristic of 
said disease type is in the range of 20. to 500. 

25 

20. A method according to claim 19 wherein the number of 
markers characteristic of said disease type is in the 
range of 50 to 300. 

21. A method according to any one of claims 14 to 20 
wherein the disease type is selected from the group 
cancer, h>^o/hyperthyroidism, diabetes, organ rejection, 
and samples for leukaemia or other hematopoetic 
disorders. 



35 



22, A method according to claim 21 wherein said disease 
state is cancer and said disease tissue is a tumour. 
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2.3. A protein expression profile database comprising 
image data which has been analysed in order to determine 
a plurality of variables for use as diagnostic markers ; 
said data being obtained, from analysis of two dimensional 
electrophoresis gels showing characteristic protein 
distribution associated .with disease type and state of 
disease for use in disease diagnosis. 

24.. A protein expression profile database according to';' 
claim 23 wherein .said, disease is cancer and the state; -of 
said diseases equates to the state of malignancy of said' 
cancer . ■ 

25. A nucleic acid expression profile database 
comprising i.mage data which has been analysed in order;. to 
determine a plurality- of variables for use as diagnostic 
markers; said data being obtained from analysis of a 
micro-array showing characteristic expressed nucleic acid 
distribution associated with disease type and state of.- 
disease for use in disease diagnosis. 

26. A nucleic acid expression profile database . according 
to claim 25 wherein said disease is cancer -and the state 
of said diseases equates to the state of malignancy . of 
said cancer . 

27. A nucleic acid expression profile database according 
to claim 25 or claim. 26 wherein the expressed nucleic 
acid is mRlJA or cDNA. 
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Tabic 1: Histopathological characteristics of samples 



Serial No. 



Case- 
No. 



Learning 
model 
label 



Test 
Cases: 
Predicted 
Result from 
PLS-DA 



True 
Type 
(A/B/C). 



■1 

1 


UU14 


A1 




A 


o 

£, 




Az 




A 


o 




A A 

A4 
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B1 
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Q 
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B2 
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OC50 


83 
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10 


OC21 


B4 
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1 1 


OC59 


B5 




B 


12 


OC68 


B6 




B 


13 ■ 


OC72 
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14 


OC77 
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15 


OC07 


01 




C 


16 


OC08 


C2 
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17 


OC09 


C3 




C 


18 


OC20 


C4 




C 


19 


OC30 


C6 




C 


20 


OC40 


C7 




C 


21 


OC43 


C8 




C 


22 


OC04 


C9 
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23 


OC06 


C10 
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24 


OC27. 


C11 
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25 


OC33 


C12 
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26 


OC48 




C 
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27 


OC45 
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28 


OC90 
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29 


OC96 




A 
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30 


OC49 
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31 


OC84 
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C 


32 


OC74 
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33 


OC73 
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34 


OC89 




B/C 
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OC95 
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36 


OC29 
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37 


0066 
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38 


OC35 
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39- 


OC111 




c 


C 


40 


OC117 




B 


B 



Pathological Diagnosis 



Serous Cystadenoma I A 
Serous Cystadenoma lA 
Serous Cystadenoma lA 
Serous Cystadenoma iA 
Mucinous Cystadenoma IIA 
Cystadenofibroma 
Borderline Seropapillary, IB 
Borderiine Seropapillary IB 
Borderiine Seropapillary IB 
Borderiine Mucinous ilB 
Borderiine Mucinous IlB 
Borderiine Mucinous IlB 
Borderiine Serous 
Borderiine Serous 
Sero Papilla ry ADC( 1 C ) 
Sero Papillary ADC{ iC) 
Sero Papillary AbC( IC) 
Seropapillary IC 
Bil Seropapillary IC 
Bil Adenocarcinoma 
Bii Seropapillary IC 
Mixed tumor 
Clear Cell tumor (IVC) 
Clear Cell tumor (IVC) 
Endometrioid Ca illC 
Sero Papillary iC 
Endometrioid Ca IllC 
Serous Cystadenofibroma 
Borderiine Serous 
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